Using Parallel Computing and Grid Systems for Genetic Mapping of Quantitative Traits
نویسندگان
چکیده
We present a flexible parallel implementation of the exhaustive grid search algorithm for multidimensional QTL mapping problems. A generic, parallel algorithm is presented and a two-level scheme is introduced for partitioning the work corresponding to the independent computational tasks in the algorithm. At the outer level, a static blockcyclic partitioning is used, and at the inner level a dynamic pool-of-tasks model is used. The implementation of the parallelism at the outer level is performed using scripts, while MPI is used at the inner level. By comparing results from the SweGrid system to those obtained using a shared memory server, we show that this type of application is highly suitable for execution in a grid framework. key words: QTL analysis, grid computing 1 Genetic Mapping of Quantitative Traits Many important traits in animals and plants are quantitative in nature. Examples include body weight and growth rate, susceptibility to infections and other diseases, and agricultural crop yield. Hence, understanding the genetic factors behind quantitative traits is of great importance. The regions in the genome affecting a quantitative trait can be found by analysis of the genetic composition of individuals in experimental populations. The genetic regions are also known as Quantitative Trait Loci (QTL), and the procedure of finding these is called QTL mapping. A review of QTL mapping methods is given in [13]. In QTL mapping, a statistical model for how the genotypes of the individuals in the population affect the trait is exploited. The data for the model is produced by experiments where the genotypes are determined at a set of marker loci in the genome. This data is input to a QTL mapping computer code, where the computation of the model fit and the search for the most probable positions of the QTL are implemented using numerical algorithms. Once the most probable 2 Mahen Jayawardena, Kajsa Ljungberg and Sverker Holmgren QTL is determined, further computations are needed to establish the statistical significance of the result. It is generally believed that quantitative traits are governed by an interplay between multiple QTL and environmental factors. However, using a model where multiple QTL are searched for simultaneously makes the statistical analysis very computationally demanding. Finding the most likely position of d QTL influencing a trait corresponds to a d-dimensional global optimization problem, where the evaluation of the objective function is performed by computing the statistical model fit for a given set of d QTL positions in the genome. So far, standard QTL mapping software [6,7,18,25] have used an exhaustive grid search for solving the global optimization problem. This type of algorithm is robust, but the computational requirement grows exponentially with d. This results in that often only mapping of a single QTL (d = 1) can be easily performed. Models with multiple QTL are normally fitted using for example a forward selection technique where a sequence of one-dimensional exhaustive searches are performed. In this type of procedure, the identified QTL are successively included as known quantities when searching for additional QTL. However, it is not clear how accurate this technique is for general QTL models. Lately, the interest in simultaneous mapping of multiple QTL has increased. Partly, the interest is motivated by analyses of real data sets [12, 26, 27] where certain interactions [11] between pairs of QTL have been found to only be detectable by solving the full two-dimensional optimization problem. 2 Efficient Computational Schemes for QTL Analysis A popular approach for computing the model fit is the linear regression method [14,17,23], where a single least-squares problem is solved for each objective function evaluation. Efficient numerical algorithms for solving these least-squares problems in the QTL mapping setting are considered in [19, 20]. Because of the exponential growth in work using the standard exhaustive grid search, algorithms for the global optimization problem for simultaneous mapping of multiple QTL have received special attention. Previously used optimization methods for QTL mapping problems include a genetic optimization algorithm, implemented for d = 2 using library routines [8], and an algorithm based on the DIRECT [16] scheme, implemented for d = 2 and d = 3 [21] and later improved to include a more efficient local search [22]. For multi-dimensional QTL searches, these new optimization algorithms are many orders of magnitude faster than an exhaustive grid search. The results indicate that the new algorithms in [22] enable simultaneous mapping of up to six QTL (d = 6) using a standard computer. The purpose of the work presented in the rest of this paper is three-fold: 1. We want to be able to perform at least a few high-dimensional QTL mapping computations using the very costly exhaustive grid search. For real experimental data, we do not know the true optimal QTL positions a priori. Parallel Computing and Grids for Genetic Mapping of QTL 3 Using results from exhaustive grid searches for representative data sets and models, we can evaluate the accuracy (and efficiency) of the more elaborate optimization methods mentioned above. 2. The extreme computational cost for performing the exhaustive grid searches for high-dimensional QTL mapping problems makes it necessary to implement a parallel computer code. This code will later provide a basis also for the implementation of the more efficient optimization schemes in a variety of high performance computing environments. 3. The structure of the multidimensional QTL search indicates that it can be efficiently implemented in a computational grid environment. Using a flexible parallel implementation, it is possible to investigate if this conjecture is valid. 3 Parallelization of Search Algorithms for Multiple QTL In Section 5, we describe a flexible parallel implementation of a scheme for simultaneous mapping of several QTL, using the linear regression statistical method and an exhaustive grid search for finding the best model fit. More details about the problem setting can be found for example in [21, 22]. In the experiments presented in Section 6, we use the parallel code to search for potential QTL positions in data from an experimental intercross between European wild boars and white domestic pigs consisting of 191 individuals [5]. The pig genome has 18 chromosomes, and its total length is ∼ 2300 cM. In the computations we use models with two and three QTL, including both marginal and epistatic effects. In this paper, we use this set of data and models as representative examples. We do not consider the relevance of the models used, nor do we consider the problem of which statistical model to use. Also, we do not attempt to establish the statistical significance of the results, and we do not draw any form of genetic implications from the computations. However, the code described in this paper provides a basis for future studies of all these issues, and for performing complete QTL mapping analysis using models including many QTL. The search for the best QTL model fit should in principle be solved by optimizing over all positions x in a d-dimensional hypercube where the side is given by the size of the genome. The genome is divided into C chromosomes, resulting in that the search space hypercube consists of a set of C d-dimensional unequally sized chromosome combination boxes, cc-boxes. A cc-box can be identified by a vector of chromosome numbers c = [c1 c2 . . . cd], and consists of all x for which xj is a point on chromosome cj . The ordering of the loci does not affect the model fit, and this symmetry can be used to reduce the search space. 1 A standard unit of genetic distance is Morgan [M]. However, distances are often reported in centi-Morgan [cM] 2 Marginal effects are additive, i.e. the combined effect from two loci equals the sum of the individual effects. For epistatic effects, the relationship is nonlinear 4 Mahen Jayawardena, Kajsa Ljungberg and Sverker Holmgren We can restrict the search to cc-boxes identified by non-decreasing sequences of chromosomes. In addition, in cc-boxes where two or more edges span the same chromosome, for example c = [1 8 8], we need only consider a part of the box. Since genes on different chromosomes are unlinked, the objective function is normally discontinuous at the cc-box boundaries. This means that the QTL search could be viewed as essentially consisting of n ≈ C/2 independent global optimization problems, one for each cc-box included in the search space. Note that this fact must also be acknowledged when using more advanced optimization algorithms. For example, it is of course not possible to utilize derivative information across a cc-box boundary. This partitioning of the problem is a natural basis for a straight-forward parallelization of multi-dimensional QTL searches: do (in parallel) i=1:n l_sol(i) = global_optimization(cc-box(i)); end Find the global solution among l_sol(:); The final (serial) operation only consists of comparing n objective function values, and the work is negligible compared to the work performed within the parallel loop. This type of parallelization was also used in [9] for mapping of single QTL. Since the objective function evaluations (model fit computations) are rather expensive, the work for performing global minimization in a cc-box is almost exclusively determined by the number of calls to the objective function evaluation routine. For an exhaustive grid search algorithm, this number is known a priori. However, since the size of the different cc-boxes varies a lot (the chromosomes have very different lengths), the work will be very different for different boxes. Hence, the two main issues when implementing the algorithm above is load-balancing and granularity for the parallel loop. We use an equidistant 1 cM grid for the exhaustive grid search, and exploit the symmetry of the search space to reduce the number of grid points. In Table 1, the total number of objective function evaluations and the total number of ccboxes, n, for our pig data example are given for searches using models including different number of QTL d. For this example, the number of objective function evaluations for the different cc-boxes ranges from 400 to about 25000. 4 Computational Grids and Other Parallel Computer Systems Grid computing has been a buzz word in the computing community for some years, and numerous research projects involving grid systems and grid computations have been initiated. Still, a common view by computational science Parallel Computing and Grids for Genetic Mapping of QTL 5 Table 1. The total number of objective function evaluations for different number of QTL in the model number of QTL (d) cc-boxes (n) function evaluations
منابع مشابه
Using Parallel Computing and Grid Systems for Genetic Mapping of Multifactorial Traits
We present a flexible parallel implementation of the exhaustive grid search algorithm for multidimensional QTL mapping problems. A generic, parallel algorithm is presented and a two-level scheme is introduced for partitioning the work corresponding to the independent computational tasks in the algorithm. At the outer level, a static blockcyclic partitioning is used, and at the inner level a dyn...
متن کاملComputational and Visualization tools for Genetic Analysis of Complex Traits
We present grid based tools for simultaneous mapping of multiple locations (QTL) in the genome that affect quantitative traits (e.g. body weight, blood pressure) in experimental populations. The corresponding computational problem is very computationally intensive. We have earlier shown that, using appropriate parallelization schemes, this type of application is suitable for deployment on grid ...
متن کاملGreen Energy-aware task scheduling using the DVFS technique in Cloud Computing
Nowdays, energy consumption as a critical issue in distributed computing systems with high performance has become so green computing tries to energy consumption, carbon footprint and CO2 emissions in high performance computing systems (HPCs) such as clusters, Grid and Cloud that a large number of parallel. Reducing energy consumption for high end computing can bring various benefits such as red...
متن کاملParallel Algorithms and Implementations for Genetic Analysis of Quantitative Traits
Many important traits in plants, animals and humans are quantitative, and most such traits are generally believed to be regulated by multiple genetic loci. Standard computational tools for analysis of quantitative traits use linear regression models for relating the observed phenotypes to the genetic composition of individuals in a population. However, using these tools to simultaneously search...
متن کاملStatic Task Allocation in Distributed Systems Using Parallel Genetic Algorithm
Over the past two decades, PC speeds have increased from a few instructions per second to several million instructions per second. The tremendous speed of today's networks as well as the increasing need for high-performance systems has made researchers interested in parallel and distributed computing. The rapid growth of distributed systems has led to a variety of problems. Task allocation is a...
متن کاملIdentification of QTLs for grain yield and some agro-morphological traits in sunflower (Helianthus annuus L.) using SSR and SNP markers
Many agriculturally important traits are complex, affected by many genes and the environment. Quantitative trait loci (QTL) mapping is a key tool for studying the genetic structure of complex traits in plants. In the present study QTLs associated with yield and agronomical traits such as leaf number, leaf length, leaf width, plant height, stem and head diameter were identified by using 70 recom...
متن کامل